CDH版本的spark阉割了spark-sql,可以重新编译cloudera/spark,或者直接在启动时添加所缺失的jar包
操作系统:SUSE11 SP3
Spark版本:SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904
CDH版本:CDH5.10.041
使用root用户对集群进行部署直接添加所缺jar包
下载Apache版Spark
解压后找到jars/目录下的
hive-cli-1.2.1.spark2.jar
spark-hive-thriftserver_2.11-2.1.0.jar
上传需要使用spark-sql的机器上创建spark-sql
在/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/bin/下新建spark-sql文件
vi /opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/bin/spark-sql
在/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/bin下新建spark-sql,同时添加之前传的两个jar包vi /opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/bin/spark-sql
同时可以设置一个别名指向spark-sqlalias spark2-sql="/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/bin/spark-sql"
在log4j中添加spark-sql
路径:/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/conf
spark1.6 spark-sql支持
spark1.6 spark-sql支持
其实就是对应的jar包不同,1.6所缺的为spark-assembly-1.6.1-hadoop2.6.0.jar
SparkHiveServer
直到最近在调研Spark并计划将Spark取代Mapreduce来提升平台的计算效率时,发现Spark-sql能完美的兼容Hive SQL,同时还提供了ThriftServer(就是SparkHiveServer),不止于此,由于Spark更好的使用了内存,期执行效率是MR/Hive的10倍以上。
其实就是在Spark集群上执行$SPARK_HOME/sbin/start-thriftserver.sh –master=spark://MASTER:7077 就默认开启了10000端口,该服务可以取代hiveserver2,如果与HiveServer2在同一台服务器上,可以先shutdown hiveserver2,再启动spark thriftserver。运行了1个礼拜,服务非常稳定,GC也正常!cat start-thriftserver.sh
后来发现只要把缺少的两个jar包导入到
cp activation-1.1.1.jar aopalliance-1.0.jar apacheds-i18n-2.0.0-M15.jar apacheds-kerberos-codec-2.0.0-M15.jar api-asn1-api-1.0.0-M20.jar api-util-1.0.0-M20.jar base64-2.3.8.jar bcprov-jdk15on-1.51.jar commons-beanutils-1.7.0.jar commons-beanutils-core-1.8.0.jar commons-configuration-1.6.jar commons-digester-1.8.jar curator-client-2.6.0.jar curator-framework-2.6.0.jar curator-recipes-2.6.0.jar guava-14.0.1.jar guice-3.0.jar guice-servlet-3.0.jar hadoop-annotations-2.6.4.jar hadoop-auth-2.6.4.jar hadoop-client-2.6.4.jar hadoop-common-2.6.4.jar hadoop-hdfs-2.6.4.jar hadoop-mapreduce-client-app-2.6.4.jar hadoop-mapreduce-client-common-2.6.4.jar hadoop-mapreduce-client-core-2.6.4.jar hadoop-mapreduce-client-jobclient-2.6.4.jar hadoop-mapreduce-client-shuffle-2.6.4.jar hadoop-yarn-api-2.6.4.jar hadoop-yarn-client-2.6.4.jar hadoop-yarn-common-2.6.4.jar hadoop-yarn-server-common-2.6.4.jar hadoop-yarn-server-web-proxy-2.6.4.jar hive-beeline-1.2.1.spark2.jar hive-exec-1.2.1.spark2.jar hive-jdbc-1.2.1.spark2.jar htrace-core-3.0.4.jar httpclient-4.5.2.jar jackson-core-asl-1.9.13.jar jackson-mapper-asl-1.9.13.jar java-xmlbuilder-1.0.jar javax.inject-1.jar jaxb-api-2.2.2.jar jets3t-0.9.3.jar jetty-6.1.26.jar jsr305-1.3.9.jar leveldbjni-all-1.8.jar mail-1.4.7.jar mesos-1.0.0-shaded-protobuf.jar mx4j-3.0.2.jar protobuf-java-2.5.0.jar slf4j-api-1.7.16.jar slf4j-log4j12-1.7.16.jar snappy-java-1.1.2.6.jar spark-mesos_2.11-2.1.1.jar stax-api-1.0-2.jar super-csv-2.2.0.jar xercesImpl-2.9.1.jar xmlenc-0.52.jar zookeeper-3.4.6.jar /opt/cloudera/parcels/SPARK2-2.0.0.cloudera2-1.cdh5.7.0.p0.118100/lib/spark2/jars/
SparkSQL accesses its metadata via the HMS directly, and does not go through a HS2, so it does not truly get covered fully by Sentry. However, in a Sentry setup the HMS is write-protected via the Sentry Authz Plugin added on it, so DDLs are still protected against, but users can still view all metadata (i.e. they can run SHOW TABLES, SHOW DATABASES, etc. and retrieve full listing [1]).
With Sentry HMS plugin and Sentry HDFS ACL Sync enabled, access to tables’ data by Spark programs would be limited to the same rules as your Beeline/other Hive clients would.
exec $LIB_DIR/spark2/sbin/start-thriftserver.sh –name SparkJDBC –master yarn-client –num-executors 10 –executor-memory 2g –executor-cores 4 –driver-memory 10g –driver-cores 2 –conf spark.storage.memoryFraction=0.2 –conf spark.shuffle.memoryFraction=0.6 –hiveconf hive.server2.thrift.port=10001 –hiveconf hive.server2.logging.operation.enabled=true –hiveconf hive.server2.authentication.kerberos.principal=hive/ddp-kettle-02.cmdmp.com@CMDMP.COM –hiveconf hive.server2.authentication.kerberos.keytab /root/hiveserver.keytab “$@”
exec $LIB_DIR/spark2/sbin/start-thriftserver.sh –name SparkJDBC –master yarn-client –hiveconf hive.server2.thrift.port=10001 –hiveconf hive.server2.logging.operation.enabled=true –hiveconf hive.metastore.schema.verification=false –hiveconf hive.server2.authentication.kerberos.principal=hive/ddp-kettle-02.cmdmp.com@CMDMP.COM –hiveconf hive.server2.authentication.kerberos.keytab /root/hiveserver.keytab “$@”
exec $LIB_DIR/spark2/sbin/start-thriftserver.sh –name SparkJDBC –master yarn-client –hiveconf hive.server2.thrift.port=10000 –hiveconf hive.server2.logging.operation.enabled=true –hiveconf hive.metastore.schema.verification=false –hiveconf spark.sql.thriftServer.incrementalCollect=true –hiveconf hive.server2.authentication.kerberos.principal=hive/ddp-kettle-02.cmdmp.com@CMDMP.COM –hiveconf hive.server2.authentication.kerberos.keytab /root/hiveserver.keytab “$@”
–hiveconf hive.metastore.schema.verification=false
–hiveconf spark.sql.thriftServer.incrementalCollect=true
–hiveconf hive.support.concurrency=true
master :指定spark提交模式为yarn-client
hive.server2.thrift.port : 指定thrift server的端口
hive.server2.authentication.kerberos.principal:指定启动thrift server的超级管理员principal,此处超级管理员为hive
hive.server2.authentication.kerberos.keytab : 超级管理员对应的keytab